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Abstract. The advent of high-throughput sequencing technologies con- 
stituted a major advance in genomic studies, offering new prospects in a 
wide range of applications. We propose a rigorous and flexible algorithmic 
solution to mapping SOLiD color-space reads to a reference genome. The 
solution relies on an advanced method of seed design that uses a faith- 
ful probabilistic model of read matches and, on the other hand, a novel 
seeding principle especially adapted to read mapping. Our method can 
handle both lossy and lossless frameworks and is able to distinguish, at 
the level of seed design, between SNPs and reading errors. We illustrate 
our approach by several seed designs and demonstrate their efficiency. 

1 Introduction 

High-throughput sequencing technologies can produce hundreds of mil- 
hons of DNA sequence reads in a single run, providing faster and less 
expensive solutions to a wide range of genomic problems. Among them, 
the popular SOLiD system (Applied Biosystems) features a 2-base en- 
coding of reads, with an error-correcting capability helping to reduce the 
error rate and to better distinguish between sequencing errors and SNPs. 

In this paper, we propose a rigorous and flexible algorithmic approach 
to mapping SOLiD color-space reads to a reference genome, capable to 
take into account various external parameters as well as intrinsic prop- 
erties of reads resulting from the SOLiD technology. The flexibility and 
power of our approach comes from an advanced use of spaced seeds [1,2]. 

The main novelty of our method is an advanced seed design based 
on a faithful probabilistic model of SOLiD read alignments incorporating 
reading errors, SNPs and base indels, and, on the other hand, on a new 
seeding principle especially adapted for read mapping. The latter relies 
on the use of a small number of seeds (in practice, typically two) designed 
simultaneously with a set of position on the read where they can hit. We 
call this principle position-restricted seeds. Advantageously, it allows us 
to take into account, in a subtle way, read properties such as a non- 
uniform distribution of reading errors along the read, or a tendency of 
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reading errors to occur periodically at a distance of 5 positions, which are 
observed artifacts of the SOLID technology. 

A number of algorithms and associated software programs for read 
mapping have been recently published. Several of them such as MAQ [3], 
MOSAIK [4], MPSCAN [5] PASS [6], PerM [7], RazerS [8], SHRIMP [9] 
or ZOOM [10] apply contiguous or spaced seeding techniques, requir- 
ing one or several hits per read. Other programs approach the problem 
differently, e.g., by using the Burrows- Wheeler transform (Bowtie [11], 
BWA [12], S0AP2 [13]), suffix arrays (segemehl [14], BFAST [15]), vari- 
ations of the Rabin-Karp algorithm (SOCS [16]) or a non-deterministic 
automata matching algorithm on a keyword tree of the search strings 
(PatMaN [17]). Some tools, such as segemehl [14] or Eland [18], are de- 
signed for 454 and lUumina reads and thus do not deal with the charac- 
teristics of the SOLID encoding which is the subject of this paper. Also, 
it should be noted that, in many cases, sensitivity is sacrificed in favor of 
speed: most methods find similarities up to a small number of mismatches, 
and few approaches account for nucleotide insertions and deletions. 

Seed-based methods for read mapping use different seeding strategies. 
SHRIMP [9] uses spaced seeds that can hit at any position of the read and 
introduces a lower bound on the number of hits within one read. MAQ [3] 
uses six light-weight seeds allowed to hit in the initial part of the read. 
ZOOM [10] proposes to use a small number (4-6) of spaced seeds each 
applying at a fixed position, to ensure a lossless search with respect to a 
given number of mismatches. In the lossless framework, PerM [7] proposes 
to use "periodic seeds" (see also [19]) to save on the index size. 

Despite the number of proposed solutions, none of them relies on 
a systematic seed design method taking into account (other than very 
empirically) statistical properties of reads. In this paper, we present a 
seed design based on Hidden Markov models of read matches, using a 
formal finite automata-based approach previously developed in [20]. To 
the best of our knowledge, this is the first time that the seed design for 
read mapping is done based on a rigorous probabilistic modeling. 

Our approach allows us to design seeds in both lossy and lossless 
frameworks. In the lossless framework, where the goal is to detect all 
read occurrences within a specified number of mismatches, we have the 
flexibility of partitioning this number into reading errors and SNPs. 

As a result, we obtain a very efficient mapping algorithm combining a 
small number of seeds and therefore a reasonable amount of index memory 
with guaranteed sensitivity and small running time, due to a restricted 
subset of positions where seeds should be applied. 



2 AB SOLiD reads: encoding and technological artifacts 



The SOLiD System [21] enables massively parallel sequencing of clon- 
ally amplified DNA fragments. This sequencing technology is based on 
sequential ligation of dye-labeled oligonucleotide probes, each probe as- 
saying two base positions at a time. The system uses four fluorescent dyes 
to encode for the sixteen possible 2-base combinations. Consequently, a 
DNA fragment is represented by the initial base followed by a sequence of 
overlapping dimers, each encoded with one of four colors using a degen- 
erate coding scheme that satisfies several rules. Thus, although a single 
color in a read can represent any of four dimers, the overlapping proper- 
ties of the dimers and the nature of the color code eliminate ambiguities 
and allow for error-correcting properties. 

As our work relies on modeling the error distribution along the reads, 
we are particularly interested in several aspects of the sequencing tech- 
nology that influence this distribution. First, since every color of the read 
encodes two adjacent bases and therefore every base affects two adjacent 
colors, it follows that any single base mutation results in the change of 
two adjacent colors in the read. On the other hand, since cycles of five 
di-nucleotide readings are performed in order to retrieve the sequence 
(as described in the documentation of Applied Biosystems [21,22]), we 
expect reading error bias to appear with a periodicity of 5. 

To confirm this intuition, we studied the variation of the reading error 
probability along the read by analyzing statistical properties of about a 
million of SOLiD reads of the S. cerevisiae genome. In this analysis, we 
used the qualities Qi associated to each position / on the read, which 
relate to the error probability through Qi = —10 ■ logio (Pe) [23]. 
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We computed the quality correlation between read positions depend- 
ing on the distance between them. Formally, if m is the read length, 
then for each iG{l,..,m — 1}, we computed the correlation through the 
following standard formula c{i) = -^(('■^^"QKQj+i-Q)) ^ -^jigj-e E{-) is the 

expectation, Q the average quality along the read, and aq the standard 
deviation of quality values. The result is given in Figure 1. It shows signif- 
icantly higher correlations (up to 0.63) between pairs of positions located 
at distances that are multiples of 5. 

Additionally, we studied the behavior of reading error probability val- 
ues along the read. As shown in Figure 2, the error probability tends to 
increase towards the end of the read, making the last positions of the 
color sequence less reliable when searching for similarities. 

3 Seed design for mapping SOLiD reads 
3.1 Seed design: background 

Spaced seeds, first proposed in the context of DNA sequence alignment by 
the PatternHunter algorithm [1], represent a powerful tool for enhancing 
the efficiency of the sequence search. 

Using a spaced seed instead of a contiguous stretch of identical nu- 
cleotides to select a potential similarity region can improve the sensitivity 
of the search for a given selectivity level [1]. Furthermore, using a seed 
family, i.e. several seeds simultaneously instead of a single seed, further 
improves the sensibility/selectivity trade-off [24,25]. The price for using 
seed families is the necessity to store in memory several indexes, one for 
each seed. In practice, however, using in the search a small number of 
seeds can significantly improve the sensitivity /selectivity ratio. 

A crucial feature of spaced seeds is their capacity to be adapted to 
different search situations. Spaced seeds can be designed to capture statis- 
tical properties of sequences to be searched. For example, [26, 27] report 
on designing spaced seeds adapted to the search of coding regions. One 
of the contributions of this paper is a rigorous design of seeds adapted 
to mapping genomic reads issued from the SOLiD technology. Note that 
here we will work with regular spaced seeds rather than more advanced 
subset seeds [20,27,28], as there is very little or no information in dis- 
criminating among different classes of mismatches that can be used to 
our advantage. 

One has to distinguish between the lossy and lossless cases of seed- 
based search. In the lossy case we are allowed to miss a fraction of target 



matches, and the usual goal of seed design is to maximize the sensitivity 
over a class of seeds verifying a certain selectivity level. In the lossless 
case we must detect all matches verifying a given dissimilarity threshold 
(expressed in terms of a number of errors or a minimal score), and the 
goal of seed design is to compute a minimal set of seeds with the best 
selectivity that still ensures the lossless search. In the context of read 
mapping for high-throughput sequencing technologies, both lossy [9,3] 
and lossless [10, 7] frameworks have been used. 

Our approach to seed design relies on a methodology proposed in 
our previous work [20], based on the finite automata theory. A central 
idea is to model the set of target alignments by a finite-state probability 
transducer, which subsumes the Hidden Markov Model commonly used in 
biosequence analysis. On the other hand, a seed, or a seed family, is mod- 
eled by a seed automaton for which we proposed an efficient compact con- 
struction [29]. Once these two automata have been specified, computing 
the seed sensitivity can be done efficiently with a dynamic programming 
algorithm as described in [20]. The seed design is then done by applying 
our Iedera software [20, 29, 30] that uses the above algorithm to explore 
the space of possible seeds and select most sensitive seeds using a sam- 
pling procedure for seeds and respective hit positions and by performing 
a local optimization on the best candidates. 

Here we apply this methodology to seed design for mapping SOLID 
reads, both in the lossy and lossless frameworks. Besides, we introduce an 
important novelty in the definition of seeds, especially advantageous for 
mapping short reads: position-restricted seeds, which are seeds designed 
together with the set of positions on the read where they can be applied. 
This can be seen as an intermediate paradigm between applying each 
seed at every position and the framework of [10] where each seed applies 
to a designated position of the read. Position-restricted seeds offer an 
additional power of capturing certain read properties (such as, e.g., an 
increasing error level towards the end of the read) in a flexible way, with- 
out sacrificing the selectivity and thus the speed of the seeding procedure. 

3.2 Modeling seeds and SOLID reads by finite automata 

We now present our model of color sequence alignments, built on the 
observations of Section 2. Note that we consider the reference genome 
translated into the color alphabet, i.e. both the reads and the genome are 
represented in color space. 



Position-restricted seeds. As shown in Section 2, the reading error 
probability increases towards the end of the read, implying that a search 
for similarity within the last positions of the read could lead to erroneous 
results or no results at all. Hence, wc can improve the seed selectivity 
by favoring hits at initial positions of the read where matches are more 
likely to be significant. We then define each seed tt jointly with a set of 
positions P to which it is applied on the read. 

We use the framework of [20] where a seed tt is represented by a 
deterministic finite automaton Q over the alignment alphabet A which 
is here the binary match/mismatch alphabet. Note that the size of Q is 
a crucial parameter in the algorithm of [20] to compute the sensitivity of 
the seed. An efficient construction of such an automaton has been studied 
in [29]: it has the optimal size of (w + 1)2*""' states, where s and w are 
respectively the span (length) and weight (number of match symbols) of 
the seed. 

Let m be the read size. To take into account the set of allowed posi- 
tions, we compute the product of Q with an automaton \p consisting of 
a linear chain of ttt. + 1 states qq, qi, . . . , Qmi 

where qq is the initial state, 
and for every g^, both outgoing transitions lead to Qi+i. Final states of 
the automaton reflect the set of possible positions P where the seed is 
allowed to hit: a state Qi is final iff z — s G P. 

A trivial upper bound on the size of the product automaton for a 
spaced seed of span s and weight w \s {w + 1) ■ 2*""^ • m. This bound 
can be improved using the notion of matching prefix, as explained in [29] . 
Thus, an economical implementation of the product of Q by A taking 
into account the set of matching positions P always produces at most 
{w + 1) ■ 2''-"' • |P| + m states. 

Furthermore, consider an interval graph of the possible placements of 
the seed on the read, where each placement spans over an interval of s 
positions. The chromatic number c of this graph can be easily computed, 
providing the maximal number of overlapping seeds. We observe that if 
this number is small (compared to (s — w + log{w))), then the size of the 
product automaton is bounded by 0{{m + 1) • 2'^). 

Model for SNPs and reading errors As explained in Section 2, there 
are two independent sources of errors in reads with respect to the refer- 
ence genome: reading errors and SNPs/indels, i.e., bona fide differences 
between the reference genome and sequenced data. We represent each of 
these sources by a separate Hidden Markov Model (viewed as a proba- 



bilistic transducer, see [20]), combined in a model which aUows ah error 
types to be cumulated in the resulting sequences. 

The SNP/Indel model, denoted Mgj^p/j, (Figure 3) has three states: 
Match, SNP and Indel, referring to matches, mismatches, and indels at 
the nucleotide level, and is parametrized by SNP and Indel occurrence 
probabilities, denoted psNP and pindel- Each transition of Msnp/i gen- 
erates a color match, mismatch or indel, with probabilities p^, p%, and pi 
respectively, defined as follows. An insertion or deletion of n nucleotides 
appears at the color level as an insertion/deletion of n colors preceded 
in 3/4 cases by a color mismatch [21]. Hence, the = 0.75 when enter- 
ing the Indel state, and = 1 for any transition having the Indel state 
as source. A nucleotide mutation is reflected in the color encoding by a 
change of two adjacent colors (and, more generally, n consecutive muta- 
tions affect n + 1 consecutive colors [21]). Thus, p% = I when entering or 
leaving the SNP state, and a color match/mismatch mixture when stay- 
ing in the mismatch state, since color matches may occur inside stretches 
of consecutive SNPs. Finally, = 1 when looping on the M state. 

The reading errors are handled by a more complex model, denoted 
Mre (Figure 4). Basically, it is composed of several submodels, one for 
each possible arrangement of reading errors on a cycle of 5 positions. 
Within these submodels, the transitions shown in red correspond to pe- 
riodic reading errors, and generate reading errors with a fixed, usually 
high probability Perr- This simulates the periodicity property shown in 
Figure 1. Switching from one cyclic submodel to another with a higher 
reading error rate (by adding another red transition, with high error 
probability) can occur at any moment with a fixed probability ps- 

The transitions shown in black in the model from Figure 4 have an 
error emission probability of 0. However, in the complete reading error 
model, wc wish to simulate the error probability that increases towards 
the end (in conformity with Figure 2). We do this by ensuring that read- 
ing errors are generated on these transitions with a probability p'erAv^^) 
(lower than Perr) given by an increasing function of the current position 
pas on the read. Technically, this is achieved by multiplying the automa- 
ton in Figure 4 by a linear automaton with m + 1 states, where m is 
the read length and the i-th transition generates a reading error (color 
mismatch) with the probability p'^^^i). The reading error emission prob- 
ability in the product model is computed as the maximum of the two 
reading error probabilities encountered in the multiplied models. 

The final model, which combines both error sources, is the product 
of MsNP/i and Mre- While the states and transitions of the product 



model are defined in the classic manner, the emissions are defined through 
specific rules based on symbol priorities. If corresponding transitions of 
^SNP/i ^-iid MjiE generate symbols a and /3 with probabilities pi and p2 
respectively, then the product automaton generates the dominant symbol 
between a and /3 with probability piP2- Different probabilities obtained 
in this way for the same symbol are added up. 

The dominance relation is defined as follows: indels are dominant over 
both mismatches and matches, and mismatches dominate matches. For 
example, (indel , mismatch) results in an indel, {mismatch, mi sm^atch) 
and {match, mismatch) represent mismatch, {match, match) is a match. 
This approach ensures that errors generated by each of the two models 
are superposed. 
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3.3 Computing the sensitivity or testing the lossless property 

Given an automaton Q specifying a family of seeds possibly restricted 
to a set of positions, we have to compute its sensitivity (in the lossy 
framework) or to test whether it is lossless (in the lossless framework). 



The sensitivity of a seed family is defined [1,31] as the probabihty for 
at least one of the seeds to hit a read alignment with respect to a given 
probabilistic model of the alignment. As outlined in Section 3.1, this is 
done using the dynamic programming technique of [20] . We therefore omit 
further details. 

In the lossless framework, we have to test if the seed specified by Q is 
lossless, i.e. hits all the target alignments. The set of target alignments is 
defined through a threshold number of allowed mismatches. 

A straightforward way to test the lossless property of Q would be 
to construct a deterministic automaton recognizing the set of all target 
alignments and then to test if the language of this automaton is included 
in the language of Q. This, however, is unfeasible in practice. The automa- 
ton of all target alignments is much too costly to construct: for example, 
in the case of threshold of k mismatches, there are Yla=o (T) different 
alignments of length m, and the Aho-Corasick automaton of these strings 
would have ^^jlg (T) states. Moreover, testing the inclusion would lead 
to computing the product of this automaton with Q which would multiply 
the number of states by that of Q. 

Alternatively, we propose an efficient dynamic programming algo- 
rithm directly applied to Q that can verify the inclusion. This algorithm 
computes, for each state q of Q, and for each iteration i G [l..m], the 
minimal number of mismatches needed to reach q at step i. Let k be 
the threshold for the number of mismatches. Then, the lossless condi- 
tion holds iff at step m, all non-final states have a number of mismatches 
greater than k. Indeed, if there is a non-final state that has a number of 
errors at most k after m steps, then there is at least one string of length m 
with at most k mismatches that is not detected by the automaton, which 
contradicts the lossless condition. This algorithm is of time complexity 
C(|Q| • 1^1 -m), and space complexity 0(1 Q| • |.4|), where A is the alphabet 
of the alignment sequences, in our case {0, 1} 

To illustrate the efficiency of this algorithm, consider the case of a 
single spaced seed of span s and weight w, yielding an automaton with at 
most (w + 1) ■ 2^~'^ states [32, 20]. On this automaton, our method runs 
in time 0{wm2^~^) which brings an improvement by a factor of ^ of 
the general bound 0(m2*) from [33]. 

In the context of color sequence mapping, it is interesting to define the 

lossless property with respect to a maximal number of allowed mismatches 
that is split between SNPs and reading errors. Since, in the color space, a 
SNP appears as two adjacent color mismatches, having k non-consecutive 




SNPs and h color mismatches implies the possibility to accept 2k + h 
mismatches with the additional restriction that there exist at least k pairs 
of adjacent ones. The automaton that recognizes the set of alignments 
verifying this condition on mismatches can be obtained by combining 
simple 3-state building blocks as depicted in Figure 5. An example of 
such an automaton, accepting 1 SNP and 2 reading errors, is illustrated 
in Figure 6 (1 and denote match and mismatch respectively). 

Note that the case of consecutive SNPs, resulting in sequences of ad- 
jacent color mismatches, is a simpler problem (since consecutive SNPs 
produce require less mismatches in the color representation than the same 
number of non-consecutive SNPs) and is covered by the proposed model: 
a seed that is lossless for alignments with non-consecutive SNPs will also 
be lossless for alignments with the same number of consecutive SNPs. 

To verify the lossless property for k SNPs and h color mismatches, 
we intersect the corresponding automaton with the seed automaton (thus 
restricting the set of alignments recognized by the seed to those with 
k SNPs and h color mismatches) and submit the result to the dynamic 
programming algorithm of Section 3.3. 

4 Experiments and Discussion 

We present now several efficient seed designs illustrating our methodology 
(more examples at http://bioinfo.lifl.fr/yass/iedera_solid). 

We first computed several sets of seeds of weight 10, restricted to 
either 10 or 12 positions among the 34 positions of SOLID reads, each 
including one or two seeds. Figure 7 shows some of the resulting seeds, 
together with the corresponding sensitivity values, computed through the 
methods described in Section 3. 

Interestingly, both single seeds 1-Lossy-10p and 1-Lossy-12p con- 
tain a double gap, which may reflect that an SNP modifies two adjacent 
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Fig. 7. Position-restricted seeds for 10 (above) and 12 (below) allowed positions. Dif- 
ferent placements of a seed correspond to the allowed positions. 



colors. However, this gap is not centered but rather shifted at the two- 
third of the seed (as observed for the best single seeds of [19]). Note also 
that in the two-seed families 2-Lossy-10p and 2-Lossy-12p, one of the 
chosen seeds is ungapped. This may be a consequence of the fact that we 
consider indels in our lossy model, which usually forces the seeds to have 
a smaller span. Another interesting observation is that two-sccd families 
2-Lossy-10p and 2-Lossy-12p are actually lossless for the threshold of 
3 mismatches, whereas single seeds 1-LosSY-lOP and 1-Lossy-12p are 
not lossless for this setting. 

We then focused on the lossless case where the maximal number of 
allowed mismatches is split between SNPs and reading errors. Using the 
procedure described in Section 3.3, we computed lossless single and double 
seeds for one SNP and two reading errors. Results are shown in Figure 8. 
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Fig. 8. Lossless position-restricted seeds for 1 SNP and 2 reading errors 



Note that the seed 1-Lossless-14p is one of several single seeds of 
weight 10 we found that satisfied this lossless condition, with no restric- 
tion on allowed positions. Interestingly, they all have a very large span 
(21) and a regular pattern with a periodic structure that can be obtained 
by iterating a simpler pattern solving the lossless problem for an appropri- 
ate cyclic problem, following the property we previously described in [19]. 
For two-seed families, Figure 8 shows a lossless pair of seeds 2-Lossless- 
8p for read length 33 (which then remains lossless for larger lengths), 
where each seed is restricted to apply to four positions only. 

To get a better idea of the sensitivity of the obtained seeds applied to 
real data, we tested them on 100000 reads of length 34 from S. cerevisiae 
and computed the number of read/reference alignments hit by each (single 
or double) seed. Alignments were defined through the score varying from 
28 to 34, under the scoring scheme -|-1 for match, for color mismatch 
or SNP, -2 for gaps. Results are presented in Figure 9. One conclusion we 
can draw is that the performance of lossless seeds 1-Lossless-14p and 
2-Lossless-8p decreases quite fast when the alignment score goes down, 
compared to lossy seeds. Intuitively, this is, in a sense, a price to pay for 
the lossless condition which usually makes these seeds less appropriate for 
the alignments with a number of errors exceeding the threshold. Another 
conclusion is that, as expected, single seeds perform worse than double 
seeds, although the overall number of positions where seeds apply is the 
same for both single and double seeds. 

Alignments hit by each seed famiiy 
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Fig. 9. Number of read alignments with scores between 28 and 34 hit by each seed 

Note finally that the choice of the best seed can be affected, on one 
hand, by different properties of the class of target alignments (number. 
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type and distribution of mismatches and indels etc.) and, on the other 
hand, by the size of the data and the available computational resources. 
The former can be captured by our probabilistic models described in Sec- 
tion 3. The latter is related to the choice of the selectivity level, directly 
affecting the speed of the search, which is defined by the seed weight and 
the number of allowed positions. Depending on the chosen selectivity, dif- 
ferent seeds can (and should) be preferred. Note in this regard that seeds 
appearing in Figure 9 have different selectivity and are then incomparable 
stricto sensu. A comparison of different seeds for SOLID read mapping in 
typical practical situations will be a subject of a separate work. 

5 Conclusions and perspectives 

In this paper, we presented a seed design framework for mapping SOLID 
reads to a reference genomic sequence. Our contributions include the con- 
cept of position-restricted seeds, particularly suitable for short alignments 
with non-uniform error distribution; a model that captures the statisti- 
cal characteristics of the SOLID reads, used for the evaluation of lossy 
seeds; an efficient dynamic programming algorithm for verifying the loss- 
less property of seeds; the ability to distinguish between SNPs and reading 
errors in seed design. 

Our further work will include a more rigorous training of our models 
and in particular a more accurate estimation of involved probabilities, 
possibly using advanced methods of assessing the fit of a model. An- 
other interesting question to study is the design of efficient combined 
lossy/lossless seeds which provide a guarantee to hit all the alignments 
with a specified number of errors and still have a good sensitivity when 
this threshold is exceeded. Computing such seeds, however, could be diffi- 
cult or even unfeasible: for example, lossless seeds tend to have a regular 
structure (see [19]) while best lossy seeds often have asymmetric and 
irregular structure. Finally, we want to define and study a lossless prop- 
erty that incorporates possible indels and not only mismatches (SNPs or 
reading errors) occurring in read alignments. 
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